[SPARK-23280][SQL] add map type support to ColumnVector #20450

cloud-fan · 2018-01-31T07:37:22Z

What changes were proposed in this pull request?

Fill the last missing piece of ColumnVector: the map type support.

The idea is similar to the array type support. A map is basically 2 arrays: keys and values. We ask the implementations to provide a key array, a value array, and an offset and length to specify the range of this map in the key/value array.

In WritableColumnVector, we put the key array in first child vector, and value array in second child vector, and offsets and lengths in the current vector, which is very similar to how array type is implemented here.

How was this patch tested?

a new test

cloud-fan · 2018-01-31T07:37:49Z

cc @hvanhovell @viirya @ueshin @kiszk @gatorsmile @dongjoon-hyun

SparkQA · 2018-01-31T08:05:02Z

Test build #86866 has finished for PR 20450 at commit 8e66fb5.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public final class ColumnarMap extends MapData

ueshin · 2018-01-31T08:09:57Z

We should also enable getMap() of ColumnarArray/ColumnarRow?

ueshin

LGTM except for one comment.

ueshin · 2018-01-31T08:56:01Z

sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala

+      column.putArray(0, 0, 1)
+      column.putArray(1, 1, 2)
+      column.putNull(2)
+      column.putArray(3, 2, 0)


column.putArray(3, 3, 0)?
Seems like line 670 is the same mistake?

the key array is: 0, 1, 2, 3, 4, 5
the value array is: 0, 2, 4, 6, 8, 10

Note that, map [1->2, 2->4] contributes 2 keys and values.

Yes, so the offset of the next array should be 3?

ah i see, you are right!

ueshin · 2018-01-31T09:32:05Z

sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala

+        }
+      }
+
+      // Populate it with maps [0->0], [1->2, 2->4], [], [3->6, 4->8, 5->10]


// Populate it with maps [0->0], [1->2, 2->4], null, [], [3->6, 4->8, 5->10]?

SparkQA · 2018-01-31T11:25:21Z

Test build #86868 has finished for PR 20450 at commit 2603d02.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-01-31T16:00:33Z

sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java

+   * To support map type, implementations must construct an {@link ColumnarMap} and return it in
+   * this method. {@link ColumnarMap} requires a {@link ColumnVector} that stores the data of all
+   * the keys of all the maps in this vector, and another {@link ColumnVector} that stores the data
+   * of all the values of all the maps in this vector, and an offset and length which specifies the


nit: an offset and length -> a pair of offset and length? Or specifies -> specify?

SparkQA · 2018-01-31T16:03:28Z

Test build #86872 has finished for PR 20450 at commit 48c275f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM.

SparkQA · 2018-01-31T18:50:41Z

Test build #86883 has finished for PR 20450 at commit 182b404.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2018-02-01T00:30:21Z

retest this please

jiangxb1987 · 2018-02-01T00:34:26Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java

@@ -530,7 +530,7 @@ public int putByteArray(int rowId, byte[] value, int offset, int length) {
  @Override
  protected void reserveInternal(int newCapacity) {
    int oldCapacity = (nulls == 0L) ? 0 : capacity;
-    if (isArray()) {
+    if (isArray() || type instanceof MapType) {


nit: we may also have a method isMap().

This might be an overkill, isArray needs to take care of many types, but isMap we only accept one type: MapType.

jiangxb1987 · 2018-02-01T00:38:48Z

sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java

+   *
+   * To support map type, implementations must construct an {@link ColumnarMap} and return it in
+   * this method. {@link ColumnarMap} requires a {@link ColumnVector} that stores the data of all
+   * the keys of all the maps in this vector, and another {@link ColumnVector} that stores the data


nit: maps -> map entries ?

keys of map entries sounds weird...

jiangxb1987 · 2018-02-01T00:42:33Z

sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarMap.java

+  }
+
+  @Override
+  public int numElements() { return length; }


numElements or length?

This is a API from parent, we can't change it.

jiangxb1987 · 2018-02-01T00:46:32Z

LGTM only some nits and naming issues.

viirya

LGTM

viirya · 2018-01-31T14:30:29Z

sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java

+   * key array with a index and a value from the value array with the same index contribute to
+   * an entry of this map type value.
+   *
+   * To support map type, implementations must construct an {@link ColumnarMap} and return it in


construct an -> construct a.

This is very minor, may not worth to wait for another QA round. Maybe we can fix it in your "return null" PR?

Sure. LGTM.

jiangxb1987 · 2018-02-01T03:05:27Z

LGTM

SparkQA · 2018-02-01T03:28:54Z

Test build #86896 has finished for PR 20450 at commit 182b404.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-01T03:56:51Z

thanks, merging to master/2.3!

## What changes were proposed in this pull request? Fill the last missing piece of `ColumnVector`: the map type support. The idea is similar to the array type support. A map is basically 2 arrays: keys and values. We ask the implementations to provide a key array, a value array, and an offset and length to specify the range of this map in the key/value array. In `WritableColumnVector`, we put the key array in first child vector, and value array in second child vector, and offsets and lengths in the current vector, which is very similar to how array type is implemented here. ## How was this patch tested? a new test Author: Wenchen Fan <[email protected]> Closes #20450 from cloud-fan/map. (cherry picked from commit 52e00f7) Signed-off-by: Wenchen Fan <[email protected]>

viirya · 2018-02-01T08:07:05Z

I found that we don't enable getMap API in MutableColumnarRow in this change, should we do it? If so, I can make a small follow-up PR for it.

ueshin · 2018-02-01T08:09:07Z

@viirya Thanks! but I'm working on it. I'll do it soon.

viirya · 2018-02-01T08:10:26Z

@ueshin Ok. No problem. :)

## What changes were proposed in this pull request? This is a follow-up of #20450 which broke lint-java checks. This pr fixes the lint-java issues. ``` [ERROR] src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java:[20,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.catalyst.util.MapData. [ERROR] src/main/java/org/apache/spark/sql/vectorized/ColumnarArray.java:[21,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.catalyst.util.MapData. [ERROR] src/main/java/org/apache/spark/sql/vectorized/ColumnarRow.java:[22,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.catalyst.util.MapData. ``` ## How was this patch tested? Checked manually in my local environment. Author: Takuya UESHIN <[email protected]> Closes #20468 from ueshin/issues/SPARK-23280/fup1. (cherry picked from commit 8bb70b0) Signed-off-by: Takuya UESHIN <[email protected]>

## What changes were proposed in this pull request? This is a follow-up of apache#20450 which broke lint-java checks. This pr fixes the lint-java issues. ``` [ERROR] src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java:[20,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.catalyst.util.MapData. [ERROR] src/main/java/org/apache/spark/sql/vectorized/ColumnarArray.java:[21,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.catalyst.util.MapData. [ERROR] src/main/java/org/apache/spark/sql/vectorized/ColumnarRow.java:[22,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.catalyst.util.MapData. ``` ## How was this patch tested? Checked manually in my local environment. Author: Takuya UESHIN <[email protected]> Closes apache#20468 from ueshin/issues/SPARK-23280/fup1.

## What changes were proposed in this pull request? This is a followup pr of #20450. We should've enabled `MutableColumnarRow.getMap()` as well. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <[email protected]> Closes #20471 from ueshin/issues/SPARK-23280/fup2. (cherry picked from commit 89e8d55) Signed-off-by: Takuya UESHIN <[email protected]>

## What changes were proposed in this pull request? This is a followup pr of apache#20450. We should've enabled `MutableColumnarRow.getMap()` as well. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <[email protected]> Closes apache#20471 from ueshin/issues/SPARK-23280/fup2.

add map type support to ColumnVector

8e66fb5

address comment

2603d02

ueshin reviewed Jan 31, 2018

View reviewed changes

address comment

48c275f

kiszk reviewed Jan 31, 2018

View reviewed changes

address comment

182b404

dongjoon-hyun approved these changes Jan 31, 2018

View reviewed changes

jiangxb1987 reviewed Feb 1, 2018

View reviewed changes

viirya approved these changes Feb 1, 2018

View reviewed changes

asfgit closed this in 52e00f7 Feb 1, 2018

ueshin mentioned this pull request Feb 1, 2018

[SPARK-23280][SQL][FOLLOWUP] Fix Java style check issues. #20468

Closed

ueshin mentioned this pull request Feb 1, 2018

[SPARK-23280][SQL][FOLLOWUP] Enable MutableColumnarRow.getMap(). #20471

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23280][SQL] add map type support to ColumnVector #20450

[SPARK-23280][SQL] add map type support to ColumnVector #20450

cloud-fan commented Jan 31, 2018 •

edited

Loading

cloud-fan commented Jan 31, 2018

SparkQA commented Jan 31, 2018

ueshin commented Jan 31, 2018

ueshin left a comment

ueshin Jan 31, 2018

cloud-fan Jan 31, 2018 •

edited

Loading

ueshin Jan 31, 2018

cloud-fan Jan 31, 2018

ueshin Jan 31, 2018

SparkQA commented Jan 31, 2018

kiszk Jan 31, 2018

SparkQA commented Jan 31, 2018

dongjoon-hyun left a comment

SparkQA commented Jan 31, 2018

jiangxb1987 commented Feb 1, 2018

jiangxb1987 Feb 1, 2018

cloud-fan Feb 1, 2018

jiangxb1987 Feb 1, 2018

cloud-fan Feb 1, 2018

jiangxb1987 Feb 1, 2018

cloud-fan Feb 1, 2018 •

edited

Loading

jiangxb1987 commented Feb 1, 2018

viirya left a comment

viirya Jan 31, 2018

cloud-fan Feb 1, 2018

viirya Feb 1, 2018

jiangxb1987 commented Feb 1, 2018

SparkQA commented Feb 1, 2018

cloud-fan commented Feb 1, 2018

viirya commented Feb 1, 2018

ueshin commented Feb 1, 2018

viirya commented Feb 1, 2018

[SPARK-23280][SQL] add map type support to ColumnVector #20450

[SPARK-23280][SQL] add map type support to ColumnVector #20450

Conversation

cloud-fan commented Jan 31, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Jan 31, 2018

SparkQA commented Jan 31, 2018

ueshin commented Jan 31, 2018

ueshin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Jan 31, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 31, 2018

Choose a reason for hiding this comment

SparkQA commented Jan 31, 2018

dongjoon-hyun left a comment

Choose a reason for hiding this comment

SparkQA commented Jan 31, 2018

jiangxb1987 commented Feb 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Feb 1, 2018 • edited Loading

Choose a reason for hiding this comment

jiangxb1987 commented Feb 1, 2018

viirya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiangxb1987 commented Feb 1, 2018

SparkQA commented Feb 1, 2018

cloud-fan commented Feb 1, 2018

viirya commented Feb 1, 2018

ueshin commented Feb 1, 2018

viirya commented Feb 1, 2018

cloud-fan commented Jan 31, 2018 •

edited

Loading

cloud-fan Jan 31, 2018 •

edited

Loading

cloud-fan Feb 1, 2018 •

edited

Loading